Goto

Collaborating Authors

 excessive risk



On Theoretical Interpretations of Concept-Based In-Context Learning

Tang, Huaze, Peng, Tianren, Huang, Shao-lun

arXiv.org Artificial Intelligence

In-Context Learning (ICL) has emerged as an important new paradigm in natural language processing and large language model (LLM) applications. However, the theoretical understanding of the ICL mechanism remains limited. This paper aims to investigate this issue by studying a particular ICL approach, called concept-based ICL (CB-ICL). In particular, we propose theoretical analyses on applying CB-ICL to ICL tasks, which explains why and when the CB-ICL performs well for predicting query labels in prompts with only a few demonstrations. In addition, the proposed theory quantifies the knowledge that can be leveraged by the LLMs to the prompt tasks, and leads to a similarity measure between the prompt demonstrations and the query input, which provides important insights and guidance for model pre-training and prompt engineering in ICL. Moreover, the impact of the prompt demonstration size and the dimension of the LLM embeddings in ICL are also explored based on the proposed theory. Finally, several real-data experiments are conducted to validate the practical usefulness of CB-ICL and the corresponding theory. With the great successes of large language models (LLMs), In-context learning (ICL) has emerged as a new paradigm for natural language processing (NLP) (Brown et al., 2020; Chowdhery et al., 2023; Achiam et al., 2023), where LLMs addresses the requested queries in context prompts with a few demonstrations.



On the Interplay between Graph Structure and Learning Algorithms in Graph Neural Networks

Su, Junwei, Wu, Chuan

arXiv.org Artificial Intelligence

This paper studies the interplay between learning algorithms and graph structure for graph neural networks (GNNs). Existing theoretical studies on the learning dynamics of GNNs primarily focus on the convergence rates of learning algorithms under the interpolation regime (noise-free) and offer only a crude connection between these dynamics and the actual graph structure (e.g., maximum degree). This paper aims to bridge this gap by investigating the excessive risk (generalization performance) of learning algorithms in GNNs within the generalization regime (with noise). Specifically, we extend the conventional settings from the learning theory literature to the context of GNNs and examine how graph structure influences the performance of learning algorithms such as stochastic gradient descent (SGD) and Ridge regression. Our study makes several key contributions toward understanding the interplay between graph structure and learning in GNNs. First, we derive the excess risk profiles of SGD and Ridge regression in GNNs and connect these profiles to the graph structure through spectral graph theory. With this established framework, we further explore how different graph structures (regular vs. power-law) impact the performance of these algorithms through comparative analysis. Additionally, we extend our analysis to multi-layer linear GNNs, revealing an increasing non-isotropic effect on the excess risk profile, thereby offering new insights into the over-smoothing issue in GNNs from the perspective of learning algorithms. Our empirical results align with our theoretical predictions, \emph{collectively showcasing a coupling relation among graph structure, GNNs and learning algorithms, and providing insights on GNN algorithm design and selection in practice.}



Reweighting Improves Conditional Risk Bounds

Zhang, Yikai, Lin, Jiahe, Li, Fengpei, Zheng, Songzhu, Raj, Anant, Schneider, Anderson, Nevmyvaka, Yuriy

arXiv.org Machine Learning

In this work, we study the weighted empirical risk minimization (weighted ERM) schema, in which an additional data-dependent weight function is incorporated when the empirical risk function is being minimized. We show that under a general ``balanceable" Bernstein condition, one can design a weighted ERM estimator to achieve superior performance in certain sub-regions over the one obtained from standard ERM, and the superiority manifests itself through a data-dependent constant term in the error bound. These sub-regions correspond to large-margin ones in classification settings and low-variance ones in heteroscedastic regression settings, respectively. Our findings are supported by evidence from synthetic data experiments.


Improving Implicit Regularization of SGD with Preconditioning for Least Square Problems

Su, Junwei, Zou, Difan, Wu, Chuan

arXiv.org Artificial Intelligence

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice and plays an important role in the generalization of modern machine learning. However, prior research has revealed instances where the generalization performance of SGD is worse than ridge regression due to uneven optimization along different dimensions. Preconditioning offers a natural solution to this issue by rebalancing optimization across different directions. Yet, the extent to which preconditioning can enhance the generalization performance of SGD and whether it can bridge the existing gap with ridge regression remains uncertain. In this paper, we study the generalization performance of SGD with preconditioning for the least squared problem. We make a comprehensive comparison between preconditioned SGD and (standard \& preconditioned) ridge regression. Our study makes several key contributions toward understanding and improving SGD with preconditioning. First, we establish excess risk bounds (generalization performance) for preconditioned SGD and ridge regression under an arbitrary preconditions matrix. Second, leveraging the excessive risk characterization of preconditioned SGD and ridge regression, we show that (through construction) there exists a simple preconditioned matrix that can make SGD comparable to (standard \& preconditioned) ridge regression. Finally, we show that our proposed preconditioning matrix is straightforward enough to allow robust estimation from finite samples while maintaining a theoretical improvement. Our empirical results align with our theoretical findings, collectively showcasing the enhanced regularization effect of preconditioned SGD.


On the Fairness Impacts of Private Ensembles Models

Tran, Cuong, Fioretto, Ferdinando

arXiv.org Artificial Intelligence

The Private Aggregation of Teacher Ensembles (PATE) is a machine learning framework that enables the creation of private models through the combination of multiple "teacher" models and a "student" model. The student model learns to predict an output based on the voting of the teachers, and the resulting model satisfies differential privacy. PATE has been shown to be effective in creating private models in semi-supervised settings or when protecting data labels is a priority. This paper explores whether the use of PATE can result in unfairness, and demonstrates that it can lead to accuracy disparities among groups of individuals. The paper also analyzes the algorithmic and data properties that contribute to these disproportionate impacts, why these aspects are affecting different groups disproportionately, and offers recommendations for mitigating these effects


Disparate Impact in Differential Privacy from Gradient Misalignment

Esipova, Maria S., Ghomi, Atiyeh Ashari, Luo, Yaqiao, Cresswell, Jesse C.

arXiv.org Artificial Intelligence

As machine learning becomes more widespread throughout society, aspects including data privacy and fairness must be carefully considered, and are crucial for deployment in highly regulated industries. Unfortunately, the application of privacy enhancing technologies can worsen unfair tendencies in models. In particular, one of the most widely used techniques for private model training, differentially private stochastic gradient descent (DPSGD), frequently intensifies disparate impact on groups within data. In this work we study the fine-grained causes of unfairness in DPSGD and identify gradient misalignment due to inequitable gradient clipping as the most significant source. This observation leads us to a new method for reducing unfairness by preventing gradient misalignment in DPSGD. The increasingly widespread use of machine learning throughout society has brought into focus social, ethical, and legal considerations surrounding its use. In highly regulated industries, such as healthcare and banking, regional laws and regulations require data collection and analysis to respect the privacy of individuals. Other regulations focus on the fairness of how models are developed and used. As machine learning is progressively adopted in highly regulated industries, the privacy and fairness aspects of models must be considered at all stages of the modelling lifecycle. There are many privacy enhancing technologies including differential privacy (Dwork et al., 2006), federated learning (McMahan et al., 2017), secure multiparty computation (Yao, 1986), and homomorphic encryption (Gentry, 2009) that are used separately or jointly to protect the privacy of individuals whose data is used for machine learning (Choquette-Choo et al., 2020; Adnan et al., 2022; Kalra et al., 2021).


A Fairness Analysis on Private Aggregation of Teacher Ensembles

Tran, Cuong, Dinh, My H., Beiter, Kyle, Fioretto, Ferdinando

arXiv.org Artificial Intelligence

The Private Aggregation of Teacher Ensembles (PATE) is an important private machine learning framework. It combines multiple learning models used as teachers for a student model that learns to predict an output chosen by noisy voting among the teachers. The resulting model satisfies differential privacy and has been shown effective in learning high-quality private models in semisupervised settings or when one wishes to protect the data labels. This paper asks whether this privacy-preserving framework introduces or exacerbates bias and unfairness and shows that PATE can introduce accuracy disparity among individuals and groups of individuals. The paper analyzes which algorithmic and data properties are responsible for the disproportionate impacts, why these aspects are affecting different groups disproportionately, and proposes guidelines to mitigate these effects. The proposed approach is evaluated on several datasets and settings.